Take-Home Exercise 3

Putting Visual Analytics into Practical Use

Lee Xiao Qi https://example.com/norajones (School of Computing and Information Systems (SMU))https://example.com/spacelysprokets
2022-05-15

1. The Task

We are required to attempt one of the three questions under Challenge 3 of VAST Challenge 2022 on the economic of the city of Engagement, Ohio USA by using appropriate static and interactive statistical graphics methods.

With that, I have chosen #3 which is to:

Describe the health of the various employers within the city limits. What employment patterns do you observe? Do you notice any areas of particularly high or low turnover? Limit your response to 10 images and 500 words.

2. The Datasets

2.1 About the Data

About 1000 representative residents have volunteered to provide data using the city’s urban planning app, which records the places they visit, their spending, and their purchases, among other things; totaling to 3 dataset folders (Activity Logs, Attributes, Journals). For this exercise, we refers to Empolyers that offers jobs as stipulated in the Jobs dataset; we thus selected the dataset Employers, Buildings, Jobs & Participants from the Attributes folder, and CheckinJournal from the Journals folder which give details on the employers, employees and jobs.

2.2 Importing the relevant packages and data

The packages tidyverse (including dplyr, ggplot2, patchwork), ggrepel, lubridate, gapminder, gganimate, ggiraph etc will be used for the purpose of this exercise:

The code chunk below is used to install and load the required packages onto RStudio.

packages = c('tidyverse','treemap','ggrepel','lubridate','gapminder','gganimate','ggiraph','plotly','zoo')

for(p in packages){
  if(!require(p, character.only =T)){
    install.packages(p)
    }
  library(p, character.only =T)
}

The code chuck below import Employers.csv, Buildings.csv, Jobs.csv, CheckinJournal.csv and Participants.csv from the data folder into R by using read_csv() and save it as an tibble data frame.

Employers <- read_csv("data/Employers.csv")
Buildings <- read_csv("data/Buildings.csv")
Jobs <- read_csv("data/Jobs.csv")
Checkin <- read_csv("data/CheckinJournal.csv")
Participants <- read_csv("data/Participants.csv")

2.3 Data Exploration and Wrangling

First, let’s get a general sense of our data using the function summary().

summary(Employers)
   employerId     location           buildingId    
 Min.   : 379   Length:253         Min.   :   3.0  
 1st Qu.: 829   Class :character   1st Qu.: 261.0  
 Median :1279   Mode  :character   Median : 486.0  
 Mean   :1089                      Mean   : 517.8  
 3rd Qu.:1734                      3rd Qu.: 782.0  
 Max.   :1797                      Max.   :1041.0  
summary(Buildings)
   buildingId       location         buildingType      
 Min.   :   1.0   Length:1042        Length:1042       
 1st Qu.: 261.2   Class :character   Class :character  
 Median : 521.5   Mode  :character   Mode  :character  
 Mean   : 521.5                                        
 3rd Qu.: 781.8                                        
 Max.   :1042.0                                        
                                                       
  maxOccupancy       units          
 Min.   :  1.00   Length:1042       
 1st Qu.:  5.00   Class :character  
 Median :  7.00   Mode  :character  
 Mean   : 15.51                     
 3rd Qu.: 12.00                     
 Max.   :418.00                     
 NA's   :539                        
summary(Jobs)
     jobId          employerId     hourlyRate      startTime       
 Min.   :   0.0   Min.   : 379   Min.   : 10.00   Length:1328      
 1st Qu.: 331.8   1st Qu.: 438   1st Qu.: 10.03   Class1:hms       
 Median : 663.5   Median : 884   Median : 14.74   Class2:difftime  
 Mean   : 663.5   Mean   :1059   Mean   : 19.13   Mode  :numeric   
 3rd Qu.: 995.2   3rd Qu.:1337   3rd Qu.: 23.32                    
 Max.   :1327.0   Max.   :1797   Max.   :100.00                    
   endTime          daysToWork        educationRequirement
 Length:1328       Length:1328        Length:1328         
 Class1:hms        Class :character   Class :character    
 Class2:difftime   Mode  :character   Mode  :character    
 Mode  :numeric                                           
                                                          
                                                          
summary(Checkin)
 participantId      timestamp                      venueId    
 Min.   :   0.0   Min.   :2022-03-01 05:35:00   Min.   :   0  
 1st Qu.: 221.0   1st Qu.:2022-06-10 18:30:00   1st Qu.: 449  
 Median : 464.0   Median :2022-10-03 20:25:00   Median : 910  
 Mean   : 480.5   Mean   :2022-10-05 07:41:29   Mean   :1015  
 3rd Qu.: 726.0   3rd Qu.:2023-01-28 08:10:00   3rd Qu.:1358  
 Max.   :1010.0   Max.   :2023-05-25 00:05:00   Max.   :1805  
  venueType        
 Length:2100635    
 Class :character  
 Mode  :character  
                   
                   
                   

2.3.1 Joining of Datasets

Employers dataset is joined with Buildings dataset based on buildingId to filter only relevant information from Buildings dataset in regards to employers. Left join on Employers is used as commercial buildings are a subset of the different type of buildings.

Employers <- Employers %>% left_join(Buildings,by="buildingId")

Next, we use outer join on Employers and Jobs based on employerId to have a full overview on all the jobs that are offered by each employer.

Employers = merge(x=Employers,y=Jobs,by="employerId",all=TRUE)

2.3.2 Renaming of Columns and Values

First, we start by renaming the columns and values of in Employers and Checkin dataset using the function rename(), and sub() for a better format and ease of reading.

Note: A check between both datasets shows that venueId in Checkin dataset refers to the employerId, pubId etc. For the purpose of this exercise, we are only interested in the employerId (venueType = Workspace) and other venues type will be removed subsequently.

# rename columns
Employers <- Employers %>%
  rename('Employer_ID' = 'employerId', 
         'Location(Pt)' = 'location.x', 
         'Location(Area)' = 'location.y',
         'Building_ID' = 'buildingId', 
         'Building_Type' = 'buildingType', 
         'Max_Occupancy' = 'maxOccupancy', 
         'Units' = 'units', 
         'Job_ID' = 'jobId',
         'Hourly_Rate' = 'hourlyRate',
         'Start_Time' = 'startTime',
         'End_Time' = 'endTime',
         'Days_To_Work' = 'daysToWork',
         'Education_Level' = 'educationRequirement')

Checkin <- Checkin %>%
  rename('Participant_ID' = 'participantId', 
         'Timestamp' = 'timestamp', 
         'Employer_ID' = 'venueId',
         'Venue_Type' = 'venueType')

Jobs <- Jobs %>%
  rename('Education_Level' = 'educationRequirement')


Participants <- Participants %>%
  rename('Participant_ID' = 'participantId', 
         'Household_Size' = 'householdSize', 
         'Have_Kids' = 'haveKids', 
         'Age' = 'age', 
         'Education_Level' = 'educationLevel', 
         'Interest_Group' = 'interestGroup', 
         'Joviality' = 'joviality')

#rename row values
Employers$Education_Level <- sub('HighSchoolOrCollege', 
                                    'High School or College',
                                    Employers$Education_Level)
Jobs$Education_Level <- sub('HighSchoolOrCollege', 
                                    'High School or College',
                                    Jobs$Education_Level)
Participants$Education_Level <- sub('HighSchoolOrCollege', 
                                    'High School or College',
                                    Participants$Education_Level)

2.3.3 Fliter Data, Compute Percentage Change in Number of Employee Over Time

We are using Checkin to see the changes in employment (i.e. checkin by participants at workplace) over time. We see that the venueId column in the dataset are IDs of all possible venues such as work place, restaurants and pubs.

Given that we are only interested in workplace, we will first filter out row that reads “workplace” using grep().

We then assign a running week number and also day to day percentage change in number of employees for each employer.

#Extract the date from timestamp
Checkin$Date <- as.Date(Checkin$Timestamp)

#Filter rows with workplace as value
Workplace_Checkin <- Checkin[grep("Workplace", Checkin$Venue_Type),]

#Assign Running Week Number
Workplace_Checkin <- Workplace_Checkin %>%
  mutate(Week_Num = as.double(ceiling(difftime(Workplace_Checkin$Date, strptime("01.03.2022", format = "%d.%m.%Y"),units="weeks"))))

#Compute no. of employees that report to work during that week (5day work week) 
Count_Checkin <- Workplace_Checkin %>%
  group_by(Week_Num, Employer_ID) %>%
  summarise('Num_of_Employees'= n_distinct(Participant_ID)) %>%  
  ungroup()

#Calculate Percentage Change  
Count_Checkin <- Count_Checkin %>%  
  group_by(Employer_ID) %>%
  mutate(
    Perc_Chg = round((Num_of_Employees - lag(Num_of_Employees))/lag(Num_of_Employees)*100,2)
  )

3. Visualizing with Charts

Next, we put up a series of charts to address the questions.

3.1 Turnover Rate

We see the absolute and percentage change in number of employee for each employer over each work week

Count_Checkin$Employer_ID <- as.character(Count_Checkin$Employer_ID)


p<- ggplot(Count_Checkin, aes(x=Week_Num, y=Num_of_Employees, group=Employer_ID)) +
  geom_line(aes(color=Employer_ID),show.legend = FALSE)+
  ylim(1,15) +
  ggtitle(label = "Number of Employees For Each Work Week")+
  xlab("Week Number from Mar 22 to May 23") +
  ylab("No. of Employees") +
  theme_minimal()+
  theme(plot.title = element_text(size=12, face="bold",hjust = 0.5))  

ggplotly(p)

Most of the manpower movement happens on the first two weeks where alot employement dropped, noting sharp decline for Employer 1763 and 868; which started off with 14 and 15 in Week 1 and then dropped to 6 employees in Week 2.

p<- ggplot(Count_Checkin, aes(x=Week_Num, y=Perc_Chg,group=Employer_ID)) +
  geom_line(aes(color=Employer_ID), show.legend = FALSE)+
  ggtitle(label = "Percentage % Change of Employees For Each Work Week")+
  xlab("Week Number from Mar 22 to May 23") +
  ylab("% Change") +
  theme_minimal()+
  theme(plot.title = element_text(size=12, face="bold",hjust = 0.5)) 


ggplotly(p)

Similarly, most of the decline happened in the first two weeks and remained stabled mostly throughout untill the last week.

3.2 Employment Pattern - Wage vs Education vs Age Group

Next, we want to know the age profile of employees for each employer. We grouped them into various age groups for easier readability.

We use left join to find out the employer for participants and the respective age group they are in.

Participants_AgeGrp <- Participants %>%
   select(Participant_ID, Age, Education_Level)

Workplace_Checkin<-
  left_join(Workplace_Checkin,Participants_AgeGrp,by = 'Participant_ID') 

We now use raincloud plot to see the wage distribution for different education level.

Jobs$Education_Level = factor(Jobs$Education_Level, levels = c('Low', 'High School or College', 'Bachelors','Graduate'))

p<- ggplot(Jobs, aes(x = Education_Level, y = hourlyRate, fill=Education_Level)) + 
  ggdist::stat_halfeye(
    adjust = .5, 
    width = .6, 
    .width = 0, 
    justification = -.3, 
    point_colour = NA) + 
  geom_boxplot(
    width = .25, 
    outlier.shape = NA
  ) +
  geom_point(
    size = 1.3,
    alpha = .3,
    position = position_jitter(
      seed = 1, width = .1
    )
  ) + 
  coord_cartesian(xlim = c(1.2, NA), clip = "off")+
  ggtitle(label = "Wage Distribution for Different Education Level",
          subtitle = "High Wages For Higher Educated")+
  theme_minimal()+
  theme(plot.title = element_text(size=14, face="bold",hjust = 0.5),
          plot.subtitle = element_text(size=12,hjust = 0.5,color='mediumvioletred'))+
  theme(axis.title.y= element_text(angle=0), axis.ticks.x= element_blank(),
        panel.background= element_blank(), axis.line= element_line(color= 'grey'))

ggplotly(p)

We see that indeed higher education are paid better.

We use a heatmap with treepmap to see the age profile for each employer.

d3tree(tm, rootname = "Age and No. of Employees under Each Employer" )